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Inducing association rules is one of the central tasks in data mining applications. 
Quantitative association rules induced from databases describe rich and hidden rela- 
tionships holding within data that can prove useful for various application purposes 
(e.g., market basket analysis, customer profiling, and others). Even though such as- 
sociation rules are quite widely used in practice, a thorough analysis of the compu- 
tational complexity of inducing them is missing. This paper intends to provide a 
contribution in this setting. To this end, we first formally define quantitative associ- 
ation rule mining problems, which entail boolean association rules as a special case, 
and then analyze their computational complexities, by considering both the standard 
cases, and a some special interesting case, that is, association rule induction over 
databases with null values, fixed-size attribute set databases, sparse databases, fixed 
threshold problems. 

1 Introduction 

The enormous growth of information available in database systems has pushed a signifi- 
cant development of techniques for knowledge discovery in databases. At the heart of the 
knowledge discovery process there is the application of data mining algorithms that are in 
charge of extracting hidden relationships holding among pieces of information stored in 
a given database [^]. Most used data mining algorithms include classification techniques, 
clustering analysis and association rule induction |^. In this paper, we focus on this latter 
data mining technique. Informally speaking, an association rule tells that a conjunction of 
conditions implies a consequence. For instance, the rule hamburger, fries ^ soft— drink 
induced from a purchase database, tells that a customer purchasing hamburgers and fries 
also purchases a soft-drink. 
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An association rule induced from a database is interesting if it describes a relationship 
that is, in a sense, "valid" as far as the information stored in the database is concerned. To 
state such validity, indices are used, that are functions with values usually in [0, 1], that 
tell to what extent an extracted association rule describe knowledge valid in the database 
at hand. For instance a confidence value of 0.7 associated to the rule above tells that 
70 percent of purchases including hamburgers and fries also include a soft-drink. In 
the literature, several index definitions have been provided (see e.g. [^, where many 
interestingness criteria are proposed). Clear enough, information patterns expressed in 
the form of association rules and associated indices indeed denote knowledge that can be 
useful in several application contexts, e.g., market basket analysis. 

In some application contexts, however. Boolean association rules, like the one above 
are not expressive enough for the purposes of the given knowledge discovery task. In 
order to obtain more expressive association rules, one can allow more general forms of 
conditions to occur therein. Quantitative association rules [ [l7| ] are ones where both the 
premise and the consequent use conditions of one of the following forms: (i) A = u; 
(u) A ^ u; (iii) A' G [l',u']; (iv) A' ^ [l',u'], where A is a categorical attribute, 
i.e., an attribute that has associated a discrete, unordered domain and u is a value in this 
domain, and A' is a numeric attribute, that is, one associated with an ordered domain of 
numbers, and /' and u' (I' < u') are two, not necessarily distinct, values. For instance, the 
quantitative rule 

{hamburger G [2,4]), [ice-cream-taste — chocolate) ^ {soft-drink £ [1,3]) 
induced from a purchase database, tells that a customer purchasing from 2 to 4 hamburgers 
and a chocolate ice-cream also purchases from 1 to 3 soft-drinks. 

In either of their forms, inducing association rules is a quite widely used data mining 
technique, several systems have been developed based on them ^ |l^], and several suc- 
cessful applications in various contexts have been described [^. Despite the wide-spread 
utilization of association rule induction in practical applications, a thorough analysis of 
the complexity of the associated computational tasks have not been developed. However, 
such an analysis appears to be important since, as in other contexts, an appropriate un- 
derstanding of the computational characteristics of the problem at hand makes it possible 
to single out tractable cases of generally untractable problems, isolate hard complexity 
sources and, overall, to devise more effective approaches to algorithm development. 

As far as we know, some computational complexity analysis pertaining association 
rules are performed in [[ill |l9[ In and [|l5|], a NP-hai'dness result is 

stated regarding the induction of association rules (or, in general, of conditions) having 
an optimal entropy (resp. chi-square); in [|l9|], under some restrictive assumptions, the 
NP-completeness of inducing quantitative association rules with a confidence and a sup- 
port greater than two given thresholds is proved along with a result stating a polynomial 
bound on the complexity of mining quantitative rules over databases where the number of 
possible items is constant. In [pl|, it is stated the #P-hardness of counting the number 
of mined association rules (under support measure), and moreover, a specialization of the 
result stated in Theorem [3.1 [ below regarding boolean association rules. Furthermore, | ]20| ] 
gives some results about the computational complexity of mining frequent itemsets under 
combined constraints on the number of items and on the frequency threshold. 

In this paper we define a generalized form of association rules embracing both the 
quantitative and the categorical and the boolean types, in which null values (in the follow- 
ing indicated by e) denoting the absence of information, are used. 

Nulls are often useful in practice. As an example, consider a market database in 
which attributes correspond to available products and values represent quantities sold. 
Null values can be used to denote the absence of a product in a particular transaction (this 
is quite different than specifying the value instead). As a further example, consider 
unavailable values in medical records representing clinical cases in analysis of patient 

'Entropy, confidence and support are indices (see below). 
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data. We call a database allowing null values, a database with nulls. 

When we induce association rules from databases with nulls, we require that condi- 
tions on attributes assuming the null value are always unsatisfied, i.e. that it is not possible 
to specify conditions on null values. A boolean association rule can be thus regarded as a 
special case of quantitative or categorical association rule mined on a database with nulls. 

In this paper, we analyze the computational complexity implied by inducing associa- 
tion rules using four of the mostly used rule quality indices, namely, confidence, support, 
6'-gain and /i-laplace [0, §|. In particular, we shall show that, in the standard case, and 
depending on the chosen index of reference, the complexity of the problem is either P or 
NP-complete. When databases with nulls are considered, independently of the reference 
index, the rule induction task is NP-complete. 

Despite these negative results, there are many cases where the problem turns out to be 
very easy to compute: whenever the instance database is sparse (i.e. each transaction/tuple 
is very small with respect to the set of possible attributes), or when the attribute set at hand 
has constant size, for any index, we are able to show that the computational complexity 
of the rule induction problem is L; furthermore introducing some constraint on the input 
instance leads to problems with very low complexity such as TC" or . Problems with 
this kind of complexity are very efficiently parallelizable (recall that C TC° C NC^, 
whereas L C NC^). 

The plan of the paper is as follows. In the following section we give preliminary defini- 
tions. In Section |] we state general complexity results about inducing association rules. 
Sparse databases and Fixed-schema complexity of rule induction are dealt with in Sec- 
tion Q and H respectively. Finally, Section ^collects an interesting set of special tractable 
cases. 

2 Preliminaries 

We begin by defining several concepts that will be used throughout the paper, including, 
among others, those of association rule induction problems and indices. 

Definition 2.1 An attribute is an identifier with an associate domain. A categorical at- 
tribute (resp., numeric attribute) is one whose domain is an unordered set of values (resp., 
a set of integer or rational numbers). Both categorical and numeric attributes include in 
their domain the special value e. 

Let A be an attribute. We denote by dom{A) the domain of A. 

Let A be a categorical or numerical attribute. We say that A is boolean if dom(A) = 
{e, c{A)}, where c{A) denotes an arbitrary constant associated to A. 

Definition 2.2 Let / be a set of attributes. A database T on / is a relation with duplicates 
having / as set of attributes. Let A G / and let t be a tuple of T. We denote by t[A\ the 
value of the attribute A in the tuple t. The size \t\ of < e T is |{A e / | t[A] ^ e}|. We 
denote by dom(A, T) the set {t[A] | t e T} - {e}. 

Definition 2.3 Let / be a set of attributes, and let T be a database on /. We say that T 
is a database without nulls if, for each t G T, \t\ = Otherwise we say that T is a 
database with nulls. 

Definition 2.4 Given a database T defined on a set of attributes / we call rriT the longest 
tuple in it. We say that T is a boolean database if every attribute A E I is boolean. 

A family S of boolean databases is sparse if, for any T E S, |tot| is 0{\og \ I\) where 
/ is the set of attributes which T is defined on. Given a family S of sparse databases, we 
will call sparse database each element T G 5. 
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Deimition 2.5 An atomic condition on A is: 

- an expression of the form A = v ox A ^ u, where ^4 is a categorical attribute and 
u is a value in the domain of A distinct from the e value, or 

- an expression of the form A £ [l,u] or A ^ [l,u], where ^ is a numeric attribute 
and / and u(l < u) are two, not necessarily distinct, numeric values. 

Whenever numerical attributes are involved, the notation A = u (resp. A ^ u) can be 
regarded as syntactic shortcut for A e. [u,u] (resp. A ^ [u,u]). 

Definition 2.6 Given a categorical attribute A, and a database T, we denote: 

dom(A = u, T) (resp. dom(A ^ u, T)) as the set dom(A. T)r\{ti] (resp. dom(v4, T)- 

{u}). 

Let A be a numerical attribute, we denote by dom(^ e [l,v],T) (resp. dom(y4 ^ 
[l,u],T)) the set dom(A,T) n I{A, [l,u\) (resp. dom(A,T) - I{A, [l,u])), where 
I{A, [I, u]) is the set {x G dom(^) \ l<x<u}. 

Definition 2.7 A condition C on a set of distinct attributes Ai, . . . , A„ is an expression 
of the form C = Ci A . . . A C„, where each Cj is an atomic condition on Ai, for each 
i = 1, . . . , n. We denote by att(C) the set ^i, . . . , A„. The size \C\ of C is n. 

We are now in the condition of defining association rules and their semantics. 

Definition 2.8 Let / be a set of attributes. An association rule on / is an expression of the 
form B =>■ H, where B and H, called body and head of the rule resp., are two conditions 
on the sets of attributes Ib and Ih resp., such that C Ib, Ih C /, and Ib r\ Ih = 0- 
The size \B =^ H\ of the rule is \B\ + \H\. 

Definition 2.9 Let 7 be a set of attributes, let T be a database on I, and let t be a tuple of 

T. Let A G /, and let Ca be an atomic condition on A, we say that t satisfies Ca, written 
t h Co, iff t[A] e dom(Ca, T). Let C = Ci A . . . A C„ be a condition, we say that t 
satisfies C, written t h C, iff f h Cj, for each i = 1,. . . , n. Otherwise we say that t does 
not satisfy C, written t\/ C.By Tc we denote the set of tuples {t G T \ t\- C}. 

Definition 2.10 Let / be a set of attributes, and let T be a database on /, and let C be 
a condition on a subset on /. We say that C is trivial if it contains at least an atomic 
condition Ca such that Tc„ = T. Let B => be an association rule on I. We say that 
B ^ H is trivial if B A fl" is trivial. 

Trivial rules with suitable value of interest can be easily buUt. Thus, we will focus, in the 
following, our attention on non-trivial association rules. 

When inducing association rules from databases in data mining applications, one is 
usually interested in obtaining rules that describe knowledge "largely" valid in the given 
database. This concept is captured by several notions of indices, which have been defined 
in the literature. In the following, we shall consider the most widely used of them, whose 
definitions are given next. 

Definition 2.11 Let 7 be a set of attributes, let T be a database on 7, and let B => 77 be 
an association rule on 7. Then: 
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1. the support of B ^ H inT, written sup{B => H, T), is 



\Tb, 



I . 



2. the confidence of B => H mT, written cnf{B H, T), is 



\Tb\ 



3. Let 9 he a rational number, < 6* < 1, then the 6- gain of B ^ H inT, written 

gaine{B =^ H, T), is I^^^^^^M-I^^I ; 

4. Let /i be a natural, h > 2, then the h-laplace of B ^ H vciT, written laplaceh{B ^ 
-n, J |Tb|+/i ■ 

Now that we have defined association rules and associated indices (that, in different forms, 
measure the validity of an association rule w.r.t. a database where it has been induced 
from), we are in the condition to formally define next the association rule induction prob- 
lems. 

Definition 2.12 Let / be a set of attributes, let T be a database on /, let fc, 1 < k < \I\, 
be a natural number, and let s, < s < 1, be a rational number. Furthermore, let 
p € {sup, cnf, laplaceii, gains}. The association rule induction problem (/, T, p, fc, s) 
is as follows: Is there a non-trivial association rule R such that |i?| > fc and p{R, T) > s? 

In general, we shall thus measure the complexity of association rule induction problems 
for the various index forms we have defined above. As a special case, we shall also con- 
sider the complexity of the induction problems when the attribute set / is assumed to be 
not part of the input, in which case we will talk about fixed schema complexity of the 
association rule induction problem. 

Remark. In the literature it is usually assumed that, in answering an association rule 
induction problem, one looks for rules which match some bounds in terms of two or more 
indices [^. Here we preferred to split the problem as to refer to one index at a time. 
Indeed, this allows us to single out more precisely complexity sources, and, moreover, 
complexity measures for problems involving more than one index can be obtained fairly 
easily from problems involving only one index. 



2.1 Complexity Classes 

We assume the reader is familiar with basic concepts regarding computational complexity 
and, in particular, the complexity classes P (the decision problems solved by polynomial- 
time bounded deterministic Turing machines), NP (the decision problems solved by poly- 
nomial-time bounded non-deterministic Turing machines) and L (the decision problems 
solved by logspace-bounded deterministic Turing machines). 

Definition 2.13 MAJORITY gates are unbounded fan-in gates (with binary input and 
output) that output 1 if and only if more than half of their inputs are non-zero. 



Definition 2.14 A family {Ci} of boolean circuits, s.t. Ci accepts strings of size i, is 
uniform if there exists a Turing machine T which on input i produces the circuit Ci. 
{Ci} is said to be logspace uniform if T carries out its work using 0(log i) space. Define 
AC" (resp. TC°) as the class of decision problems solved by uniform families of circuits 
of polynomial size and constant depth, with AND, OR, and NOT (resp. MAJORITY and 
NOT) gates of unbounded fan-in [0, |, 



6 



Definition 2.15 For any k > 0, #AC5^ is the class of functions / : {0, 1}* N com- 
puted by depth k, polynomial size uniform families of circuits with +, x -gates (the usual 
arithmetic sum and product in N) having unbounded fan-in, where each value incoming 
into the circuit can be either constant (where the allowed constant values are 1 and 0) or 
being an input value in the form or 1 — Xi (where the allowed input values are 1 and 
0). Let #ACO-Uk>o #ACSJ [§■ 

Thus, ^AC^ circuits accept the values 1 and as inputs, but they are considered as natural 
numbers. 

Definition 2.16 GapAC" is the class of all functions / : {0, 1}* N that can be ex- 
pressed as the difference of two functions in #AC° [jl|, |[|. PAC° is the class of languages 

{A I 3/ e GapAC", x G A ^ fix) > 0} 

3 General complexity results 

Here we investigate the complexity of evaluating (/, T, p, k, s) when /, T, k and s are all 
taken as input values. 

Definition 3.1 Let / be a set of numerical attributes, and let T be a database on /. Let A 
be an attribute in /, and let u be a value. Define 

- lub(M, A, T) = min{u e dom(A, T) \ v > u], and 

- glb(u, A, T) = max{v G dom(A, T) \ v < u}. 

Let C ~ A G [l,u] (resp. C = A ^ [I, u]) be a non trivial atomic condition such that 
\Tc\ > 0. Define 

bot(C,r) = Ae[luh{l,A,T),glh{u,A,T)] 

(A i [lub(?, A, T), glb(u, A, T)] resp.) 

Let C = Ci A . . . A C„ be a non trivial condition such that \Tc \ > 0. Define 



bot(C, T) ^ bot(Ci , T) A . . . A bot(C„, T) 



Proposition 3.1 Let I be a set of numerical attributes, Let T be a database on I, and let 
C be a non trivial condition on a subset of I such that \Tc\ > 0. Then Tc ~ 7bot(c,T)- 

Proof. Straightforwai'd. □ 



Proposition 3.1 has the technically important consequence that we can restrict our 
attention to conditions and association rules including only values from the database of 
interest. 

Now we prove that, when support is assumed as the reference index, the association 
rule mining problem is NP-complete both in presence or absence of nulls. We point 
out that the following result extends the two more specific results presented in |p^, that 



applies only to boolean databases (there called 0/1 -relations), and in that applies 
only to numerical databases without nulls and to conditions on intervals containing at 
least two distinct numbers. 

Proposition 3.2 Consider the problem V — (/, T, swp, k, s). If there exists a rule B 
H that is a solution for V, then for each k' , 1 < k' < k, there exists a rule B' => H' of 
size k' such that sup{B' ^ iJ', T) > s. 
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Theorem 3.1 Given a database T without nulls, the problem (I ,T, sup,k, s) is NP- 
complete. 

Proof. (Hardness) The proof is by reduction of the problem CLIQUE, which is well- 
known to be NP-complete [p^]. Let G = (F, E) be an undirected graph, with set of nodes 
V = {vi, ... ,Vn} and set of edges E = {ei = {vp^,Vq^}, ... ,e.m^ {^p^^Vq^}}- Let 
h be an integer. The CLIQUE problem is: Does there exist in G a complete subgraph 
(clique) of size at least h ? 

W.l.o.g. suppose the graph G is connected. We build an instance of (/, T, sup, /c, s) 
as follows: let /^'"^ be the set consisting of the attributes /i, . . . , so that Ij represents 
the node Vj of G, for each j = 1, . . . , n. Let T'^''^ be the database on /'^'' formed by 
a tuple ^e;, for each i = 1, . . . , m, such that t^^ = if Vj G e.;, and t^^ = 1 
otherwise (^e; encodes the edge of G). Next, we prove that G has a clique of size fc in 
G iff r<='«, sup, n-k, ^^^^) is a YES instance. 
We have the following fact. 

Fact 3.1 Let J e r'-i, let C' = [Lj = 0) (or, equivalently G' = [Ij ^ I)), and let C" be 
a non trivial condition defined on a subset ofr^i- {I j}. Then \T^%c"\ ^ n-\G' AG"\. 

We can resume Theorem's proof. 

(^) Let G = {vri , . . . , Wrfc } be a clique of size k in G. Consider the condition 



BAH 



Since G is connected, B A H is non trivial. By definition of clique, there exist ^^''^ 
edges of G connecting nodes in G. Therefore, the cardinality of 

equals Clearly T' C T'^^^^ and sup{B ^ ff, T'^'?) > 




(-^) By Proposition^ if yc;,^ ^^^^ n - fc, ''^^^^^ ) is a YES instance then there 
exists a non trivial rule B ^ H of size n — k such that jT^'^^l > 

First, we note that atomic conditions on numerical attributes of the form Ij e [0, 1] are 
trivial, while the same does not apply to categorical attributes. W.l.o.g. assume fc > 4. 
By contradiction, suppose that there exists a condition Ij ~ (or Ij ^ 1) occurring in 
B ^ H, then, by Fact |3.li |rBA_f/l 1^ k < . Hence only conditions of the form 

Ij = 1 (Ij / 0) can appear in B =^ H. 

Let - att(S AH) = . . In order to be \Tp^^\ > T^^'^^ 

contains, at least, the set 

{*K.,«.jeT^'«|l<a:;<y<fc} 
i.e. the nodes , . . . , form a clique of G having size fc. 

(Membership) Certificate: an association rule B iJ on a subset of /. Polynomial 
checking: verify that B ^ H is non trivial, that \B ^ H\ > k, and that sup{B ^ 
H,T)>s. □ 

Theorem 3.2 Given a database T with nulls, the complexity of {I, T, sup, fc, s) is NP- 
complete. 



Proof. (Sketch) The proof use the same line of reasoning as in Theorem 3.1 However, 
this time, we use e values instead of values in the reduction. Furthermore, we note that 
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Figure 1: An example of the reduction used in Theorem 3.1 



rules including conditions of the form Ij ^ 1 imply that the value of the support is 0, 
hence only conditions of the form Ij = 1 can be taken in account. □ 

It is generally believed that when both support and confidence are measured, the latter 
task (i.e. filtering out rules with low confidence value from a set of rules having support 
above some threshold) is far easier to compute ^]. We prove next that the prob- 
lem of finding association rules having high confidence on databases without nulls is a 
tractable problem, while the same problem on databases with nulls presents per se some 
computational difficulty. 

Lemma 3.1 Let I be a set of attributes, let T be a database without nulls on I, and let 
< 3 < Ibe a rational. Then there exists a non trivial association rule B ^ H on I such 
that cnf(B ^ H,T) > s ijf there exist an attribute Jh £ I, a value Ufj G dom( J//, T), 
and a tuple t £ T, such that the rule 

I f\ {J^t[J\)\^{JH^UH) 

\je(i~{JH}) I 

is non trivial and has confidence greater than or equal than s. 
Proof. (^) Let 

{B ^ H) = {CiA...ACh-^ Ch+i A . . . A Cfe) 

where Ci is an atomic condition, for each i = 1, . . . , fc. Let Jh = att(Cfe), and let 
uh G (dom(J/f , r) — dom(Cfc, T)). Since Ck is non trivial, uh always exists. Consider 
the rule 

(B' ^ H')^iC\A...A Ck-i => {Jh + uh)) 

Then 

cn/(B' H\T) > cnf{B ^ H,T)>s 
Let I — {Jh} — Ji, ■ ■ ■ , Jn-i- For each t G T, we denote by C{t) the condition 

{Jl = t[Jl]) A . . . A {Jn-l ^ t[Jn-l]) 

Let T' be a maximal subset of Tb> such that for each t £ T' there does not exist t' G 
r - {t} such that (t[Ji] = t'[Ji]) A ... A [t[Jn-l] = t'[Jn^{\). 

IT /I 

We show that there exists t £ T' such that fi.*'^"? > s. Assume by contradiction, for 
each t G T', '^f'*'^f < s. Then 
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For each i = 1, . . . , |/|, consider the ith attribute Ji of I ; 
Build the ordered database T' by sorting T 
w.r.t. the sequence Ji, . . . , Ji-i, Ji+i^ ■ • • j Jn, Ji,' 
For each block B of adjacent tuples ofT^ 
that are identical on the attributes I — { Ji}; 

Determine the value b — niin„gdom( Ji,T) |{^ G ^ I t[Ji\ — 
If{\B\ — b)/\B\ > s then return "yes"; 
Return "no"; 



Figure 2: The algorithm deciding the confidence problem on databases without nulls 



cnfiB' H', T) = '"f ' = ^/^"''7i"'^" ' < 



|Ut"eT' ^C(t") I St"i=T' |-^c(t") I Et"er' |-^c(t") 



S 



^C(-)I 

C{t) => H' is the required rule. Finally, we note that the rule is clearly non trivial 



But cnf{B' ^ H', T) > s. Then there exists t e T' such that f'"^T > s. Hence 



{<=) Straightforward. □ 

Theorem 3.3 Given a database T without nulls, the problem {I, T, cnf, k, s) is in P. 

Proof. (Sketch) The problem can be solved in time 0{\I\ -ITP log |r|) by testing if there 
exists an association rule of the form described in Lemma |3. 1, with confidence exceeding 



the threshold s. Figure || reports the algorithm deciding the problem (/, T, cnf, k, s) on 
databases without nulls. □ 

Proposition 3.3 Consider the problem V = {I, T, cnf, k, s). If there exists an associa- 
tion rule B ^ H that is a solution for V, then the rule B' =^ H' also solves V, where 
B' AH' = B AH and\H'\ = 1. 

Theorem 3.4 Given a database T with nulls, the complexity of (/, T, cn/, k, s) is NP- 
complete. 



Proof. (Hardness) The proof, as in Theorem 3.1, is by reduction of CLIQUE. Let G = 
(y, E) be an undirected graph, with set of nodes V — {wi, ... , w„} and set of edges 
E = {ei = {wpi , }, . . . , e,„ — {vp^ , Vq^}}- We build an instance of (/, T, cn/, k, s) 
as follows. 

Let I'^'' be /' U {/„+i}, where /' = {/i, ...,/„}, Ij represents the node Vj of G, 
for i — 1, . . . , n, and is a new attribute representing a new node Vn+i- Let T^'"* = 
T' U T", where T' includes the tuples t^^ and t'^^, where ie, [Ij] = e (resp. t'^. [Ij] = e) if 
Vj e ei, and tg. [Ij] = 1 (resp. t'^. [Ij] — 1) otherwise, for i = 1, . . . ,m, j = 1, . . . , n+1, 
(the tuples and t'^, both denote the edge of G). 

Furthermore, T" includes the tuples ty^, where t^. [Ij] = e if i = j, and t„. [Ij] ~ 1 
otherwise, for i = 1, . . . ,n + 1, j — 1, . . . , n + 1. Next, we prove that there exists a 
clique of size fc in G iff (r'-'^, T'^'?, n-k + 1, is a YES instance. 

Fact 3.2 Let G be a condition on a subset of r, then |r^| < 2("^'^l) and \T^\ < 
n + l-\G\. 

(^) Let G — {vri , . . . , f rfc } be a clique of size k in G. Consider the condition B — 
(^Av e{V-C)i^j ^ such that \B\ = n — k. By definition of clique, there exist ^^^^ 
edges of G connecting nodes in G. Now, 
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Thus, \T'g\ = 2("-J^') = k{k - 1), whereas |r^| = n + 1 - \B\ = k + 1. Hence, 
\Tg''\ ^ k(k - 1) + (fc + 1) = fc2 + 1, and 



■Aq 



cnj{a {in+i — -Lj, J — — |j,ci<j — prpi 

(<^) If [r^i^ r^'?, n - fc + 1, is a YES instance, then there exists B ^ H on /^'«, 

with I = 1, such that \B\ > n — k (by Proposition 3.3). 

First, we note that the presence in the rule of atomic conditions of the form Ij ^ 1 
implies that the rule has confidence 0. Hence only atomic conditions of the form Ij = 1 
can appear in B ^ H. 

The content of T" implies that there is no association rule having confidence 1 on T'^'-i, 
Furthermore, we can infer that jT^'j > fc^ + 1, otherwise the ratio ^^^i" would not be 

greater than or equal to -p^^- Two cases are to be considered: (a) /„+i ^ att(_B); (6) 
In+i e att(B). 

(Case a) Assume that att(B) C /'. Then \Tg'^\ >P + l impHes that \B\ < n ~ fc, and 
we have already noticed that > n — k. Thus \B\ = n — k and {Tg''] = fc^ + 1. Let 
/' — att(_B) = {7^1 , . . . ,Irk}- Since \B\ = n — k, then \Tg\ = fc + 1, whereas, in order 
to be \Tg\ — k{k — 1) it is necessary that 

Thus E D {{vr^ ,Vr)\l<x<y< fc}, and the nodes v^, ■ ■ ■ , form a clique of G 
having size fc. 

(Case b) Suppose that B = B' A (/„+i = 1). Then \Tg''\ > P + 1 impUes that 
\B'\ < n — fc — 1, and we have already notice d th at \B\ > n — fc, i.e. \B'\ > n — fc — 1. 
Thus \B'\ = n — k ~ 1 and (by recalling Fact 3.2) 

nciq^ .^fn-\B' 



fc2 + l < |r^"?| < 2(^ 2 j +(" + l-|5'|) = fc^ + 2fc + 2 

We can show that there does not exist a tuple t E T' such that t\/H and t h B. Assume, 
by contradiction, that such a tuple t E T' exists. Then < iT^j'l ~ 3. This implies 

that the confidence of the association rule B ^ H cannot be greater than or equal to 

fc^/(fc^ + 1), since 



(Vk) '^b'I-^ < (fc^ + 2fc + 2)-3 

^ ' " fc2 + 2fc + 2 fc2 + l 



Thus, H is such that {T^^^l = \T^ \ = \T'g, \ = \Tb'ah\- Since |T^''| > fc^ + 1 and, by 
Fact we know that |T^''^^| < fc^ + 1 (note that \B' A H\ = n - fc), it follows that 
\Tb'ah\ = \Tb'\ = fc' + 1- Let /' - att(B A i7) = . ,/,J. Hence 

T'b - {kvr^,V^y}^t'{v^^,Vr.,} \ < ^ < V < k} 

Thus E 3 {{vr^ ,Vr)\l<x<y< fc}, and the nodes v^, ■ ■ ■ , form a clique of G 
having size fc. 

(Membership) Certificate: an association rule B ^ H on I. Polynomial checking: verify 
that B ^ His non trivial, \B AH\> fc, and cnf{B ^ iJ, T) > s. □ 



Despite the syntactical similarity with confidence, the laplace metric is closer to support 
than confidence. Consider the laplace expression. For each rule i? => iJ, database T, and 
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Figure 3: An example of the reduction used in Theorem 3.4 



fixed value of \Tb\, laplace is maximum when \ — \Tb \ ■ Assume the above relation 

is satisfied. In order to be laplaceh{B ^ H,T) > s, it must be the case that \Tbah\ > 
^yEt- Assume now that s 1; this implies that \Tb/\h\ 00, i.e. that sup{B ^ 
H,T) 1. The following Theorem formalize the above intuitive argumentation. 



Theorem 3.5 Let T be a database without nulls. Then the complexity of (/, T, p, fc, s), 
with p g {gaing, laplaceh}, is NP-complete. 

Proof. (Hardness) Once again, for the hardness part, we use a reduction of CLIQUE. 
Thus, let G ~ (y, E) be an undirected graph, with set of nodes V = {vi, ... , u„} and 
set of edges E — {ei = {vp^,Vq^}, . . . , e,„ = {f^p^^Vq^}}. Let /'^''^ be the set of 
attributes /i, . . . /„, /„+!, where Ij denotes the node vj of G {j = 1, . . . ,n) and In+i 
is an additional attribute. Furthermore, let T"^'' include the tuples tp.,t'^^ s.t. iei[^] = 
<g [Ij] = if Vj G Ci, and 1 otherwise, where tg. and t'^, both denote the edge of G, 
for each i = 1, . . . , m, and the tuple to, s.t. to [Ij] ~ for each j — 1, . . . . n + 1. 
Let s'^'"'' be ''^~2m4a~^'' ^ fc"(fc-i)+^ resp.). Next we prove that there exists a clique of size 
fc in G iff T'^'', game, n - fc + 1, s'^'?) T'^'?, laplacen, 71 - fc + 1, s^'"?) resp.) 

is a YES instance. 
We have the following facts. 



Fact 3.3 Let Ij e r''^, let C' = (/^ = 0) or C' = 
condition on a subset of r^i - {Ij}. Then \T^%c' 



{Ij =/= I), and let G" be a non trivial 
I < 2{n-\C' AG"\). 



Fact 3.4 Let G be a condition on a subset of I^^'' 
tions of the form Ij — 1 or Ij ^ 0. Then |T^'' 



composed by atomic condi- 



< 2r-f 1; 



12 



We can resume Theorem's proof. 

(^) Let C = {I'n , ■ • ■ , Vr^ } be a clique of size k in G. Consider the condition B = 

(^Av G{v-c)i^j ~ -'^))' '■^^ condition H = = 1). Clearly, B A H is non 

trivial. By definition of clique, there exist '''■■''^^'> edges of G connecting nodes in G. 
Therefore, the cardinality of 

T' - ^{.....„„} e T^'" \l<x<y<k} 

equals k{k - 1). Clearly T' C T^'^^, and T^^^j = Tg'^, hence 

gaing{B ^ H^T'''^) > s""^" 

(laplacchiB ^ H,T''^i) > s'=^i resp.). 

If T^;,^ -^^^ „ _ + 1^ (^jcZg^ yc/g^ laplacch, 71 - k + 1, s"^'') rcsp.) 
is a YES instance then there exists a non trivial rule B ^ H on J'^''' such that \B A H\ > 
71 - fc + 1 and gaine{B ^ H, T^'?) > s'^'^ (laplaceh{B ^ H, > s^'« resp.). 
We show that only conditions of the form Ij — 1 (or Ij ^ 0) can appear in B A H. First, 
we note that atomic conditions on numerical attributes of the form Ij e [0, 1] are trivial, 
while the same does not apply to categorical attributes. W.l.o.g. suppose fc > 3. By con- 
tradiction, suppose that there exists Ij = (or Ij ^ 1) occurring in B A H, then, by Fact 
3.3, \Tg^fj\ < 2(fc — 1). As gaing (laplacch resp.) increases when \Tbah\ increases 



and \Tb \ decreases, and is maximum for \Tbah\ = \Tb\, then 

2m + 1 

(laplaceniB ^ H,T^'i) < fg^^gi < resp.). 

We show that /„+i G att{B AH). By contradiction, suppose /„+i ^ att{B AH). Then, 



by Fact ^ {Tgl^l < (k - l){k - 2), hence 



ga^n,iBAH,T-^) < ^^-^-^-2) ^ 

2m + 1 

{laplacetiB A H,T^''i) < glljlfclaj+h < r^sp.). 

Let H' — {In+i = !)■ We can obtain from B ^ H an association rule B' ^ H' such 
that gaing{B' ^ H' , T^^i) > s^'"? (laplaceh{B' =^ H' , T^'?) > s^'"? resp.). Simply take 
as B' the condition such that BAH equals to B' A {In+i = 1) (or B' A (/„+i ^ 0)). We 
note that \B'\ > n - k. 

As l^sl'^^ij-l = \Tp\, then gaing{B' =^ iJ',r='9) > s'^'? (laplacchiB > 
s"^'' resp.) implies that |T^''| > fc(fc — 1). Thus \B'\ < n — k, and we have akeady 
noticed that \B'\ > n — k, then the size of B' is exactly n — k. 

Let r^i - att(B') = {Ir,,. . . , Jr,,/„+i}. In order to be \Tp\ > k{k - 1), Tp con- 
tains, at least, the set {tty y „ ; G T'^''^ \ ^ < x < y < k}, i.e. the nodes 
1;^! , . . ■ , form a clique of G having size k. 

(Membership) Certificate: an association rule B ^ H on a subset of /. Polynomial 
checking: verify that B ^ H is non trivial, that \B ^ H\ > k, and that gaine{B ^ 
H,T)>s (laplaceh{B => H,T) > s resp.). □ 

Theorem 3.6 Let T be a database with nulls. Then the complexity of (/, T, p,k, s), with 
p G {gaing, laplacBh}, is NP-complete. 

Proof. (Sketch) The proof use the same line of reasoning as in Theorem 3.5 However, 
this time, we use e values instead of values in the reduction. Furthermore, we note that 
conditions of the form Ij ^ 1 imply that the value of gain and laplace is 0, hence only 
conditions of the form = 1 are admissible. □ 
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begin 

for i := 1 to |r| do 
if \ti\> k then 

for guess := 1 to 2'*»' — 1 do 

if guess has exactly k bits set to 1 then begin 
count := 0; 
for j := 1 to lr| do 

if SATISFIES(tj, guess, ti) then count := count + 1; 
it count > s\T\ then return "yes"; 
end; 
return "no"; 
end. 



function SATISFIES(«, guess, u) : boolean; 
begin 

p:=l; 

for g := 1 to /| do 

if ^[ylq] = c{Aq) then begin 

if guess[p] — 1 and v[Aq] = e then return false; 
p := p + 1; 
end; 
return true; 
end; { SATISFIES } 



Figure 4: The algorithm of Theorem 4-. 1 



4 Sparse databases 

There are many real apphcations having associated sparse databases. As an example 
consider a database of transactions from a large market stored for basket analysis purpose. 
For databases showing this property, complexity figures are quite different from what we 
have proved above. 

Theorem 4.1 Let T be a sparse database. Then the complexity of (/, T, sup, fc, s) is in 
L. 

Proof. We can build a Turing Machine T employing C'(log(max{|/|, |T|})) space, which 
decides (/, T, sup, k, s). 

Let T = {ti, . . . , tm}, and let / — {Ai, . . . , An}- Let guess be a (log-space) 
counter, and let p be an integer, then guess[p] denotes the value of the p-th bit of guess. 
The algorithm which is followed by T is depicted in Figure ^. 

Roughly speaking, T considers each tuple ti, using the counter i, and tests only those 
conditions which can be built on ti. It is not necessary to represent each condition ex- 
plicitly; the counter guess is employed instead: the p-th bit of guess tells whether the 
p-th non null attribute value occurring in ti belongs to the current condition or not. Each 
guessed condition is then tested on each transaction tj of T, using the counter j. The 
counter count takes into account the number of tuples satisfying the current condition. 

It is straightforward to note that the space employed corresponds to the space needed 
to store the variables count, p, q and guess. On the assumption that T is sparse, i, j 
and count need ©(log |T|) space, whereas p, q and guess need 0(log |/|) space. Finally, 
verifying if guess has at least k bits set to 1 can be easily done in logarithmic space. □ 

Theorem 4.2 Let T be a sparse database. Then the complexity of {I, T, p, k, s), where 
p e {cnf, gaing, laplaceu} is in L. 
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Proof. (Sketch). The proof follows the same line of reasoning of Theorem 4.1. In this 



case, two disjoint current conditions are needed (which represent the body and the head 
of the current association rule, respectively), and some further auxiliary counters using 
logarithmic space. □ 



5 Fixed schema complexity 

In this Section we improve the result reported in JT^, stating the polynomial-time solv- 
ability of the association rule mining problem under the fixed schema complexity mea- 
sure. For simplicity, we give only the proof regarding numerical attributes. 

Theorem 5.1 Let I be a set of numerical attributes. Then the fixed schema complexity of 
the problem (/, T, sup, fc, s) is in L. 

Proof. Let n — \I\, and let m = |T|. We can build a Turing Machine T employing 
C'(logm) space, which solves (/, T, sup, k, s). T use 2n pointers p^, p", to 2ti tuples 
of T, of size ©(log m) each, and 2n bits oj and ij, for each j = 1, . . . , n. An arrange- 



X 



men? of T is a 4n-tuple oi, . . . ,o„,Ji,... ,i„) G {1,... , m} 

{0, Let U denote the i-th tuple of T; define 9(0) as "G", 0{l) as and Q as 

^{Oj) [tp! [Ij],tpu for each j = 1, . . . ,n. 
T works as follows: it scans, one after one, all the arrangements, and for each of them 
performs the following steps: (1) Verifies that ii + . . . + in = k; {2) If step 1 succeeds, 
verifies that t^i [Ij] ^ e and tp" [Ij] ^ e, for each j = 1, . . . ,n such that ij — 1; (3) If 
step 2 succeeds, verifies that t^i [Ij] < tp^ [Ij], for each j — 1, . . . ,n such that ij ~ 1; 
(4) If step 3 succeeds, verifies that the conditions Cj, for each j = 1, . . . ,n such that 
ij = 1, are non trivial; (5) If step 4 succeeds, verifies that \T^ . _^Cj \ ^ ^l^^h (6) If step 
5 succeeds, return "yes" and stops. 

If T does not reach step 5, finally return "no" and stops. We note that, to execute 
steps 1-5, the Turing Machine needs an additional amount of space, to store counters and 
auxiliary pointers, that is logarithmic w.rt. the input size. It follows that T returns "yes" 
iff (/, T, sup, k, s) is a YES instance. □ 

Theorem 5.2 The fixed schema complexity of the problems {I ,T, p, k, s), where p G 
{cnf, gaing, laplacch}, is in L. 



Proof. (Sketch) The proof use the same line of reasoning as in Theorem 5.1 



□ 



6 Further complexity results 

In this section, we investigate the computational complexity of several interesting special 
cases of mining association rules. Most of them assume some parameters (e.g., the lower 
bound on the rule length k, the index value threshold s) of the general association rule 
mining problem to be fixed. The relevance of the analysis we present below is two-fold. 
First, it eases the task of detecting actual complexity sources. Second, from a practical 
point of view, users are often interested in solving such simplified tasks, as, for instance, 
when one wishes to mine only rules with a support always larger than 75 percent. 

As stated below, the rule mining problem remains very hard to solve whenever the 
support threshold is kept fixed. 

Theorem 6.1 The problem {I, T, sup, k, .s) where s is a fixed constant in (0,1), and T is 
a database with nulls is N P-complete. 
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Proof. Let / be a set of attributes /i, . . . , /„ defined on the domain {e,c}. Let T be a 
boolean database defined on / and let S* be a subset of /. A tuple t s.t. t[J] = c, for each 
J € I, and s.t. t[J] = e otherwise, will be defined in the following as < = 5*. 

(Hardness) The proof is by reduction of CLIQUE. Let G = (V, E) be an undirected graph, 
with set of nodes V — {vi, ... , w„} and set of edges E ~ {{vp^ , w^J, . . . , {vp^,Vq^)}. 
We build a corresponding instance of {I,T,sup,k,s) as follows: let J^''' be the set 
consisting of the attributes /i, . . . 7„, where Ij represents the node vj of G, for 

j = 1, . . . ,n and is an additional attribute. Let T^'"* be a set composed by the union 
of the following sets of tuples: 

- T*^, including the tuples ti = /^'"^ — {/p. , Ig. , /„+i}, where ti represents the edge 
(wp. , Vq^ ) of G, for each i = 1, . . . , to; 

- r", including cq copies of the tuple where cq is a value to be defined next; 

- T^, consisting of ci copies of the tuples /^''^ — where ci is a value to be 
defined next. 

As for the values cq and ci we choose two positive or null integer values such that 

fc(fc-i) 



2 



Cl 



TO + Co + Cl 



It can be shown that such two values exist, and are both polynomial bounded in to. Indeed, 

let a — k{k — 1) /2, and s — ax/{bx): we have 



ax a + Cl 



bx TO + Co + Cl 

where a, b and x are positive integers and a < b. Thus, cq = ax — a and ci = bx — ra ~ 
{ax — a). Setting x equal to, e.g., to + a, yields the two required values. 

Next, we prove that there exists a clique of size kmG iff (/'^''^, T'^^'^, sup, n — k,s) is 
a YES instance. 

(^) Let G ~ {vri , ■ ■ ■ , Vrf, } be a clique of size k in G. Consider the condition 



BAH ■ 



By definition of clique, there exist k{k ~ l)/2 edges of G connecting nodes in C, i.e. 
we can build a set T' = {(/'^''J - {/^^ , /^^ , /„+i}) e T^^ | 1 < a; < y < fc} of 1)/2 
tuples. Clearly, T' C T^'^^. Thus jT^'^^l > k{k - l)/2 + ci and sup{B AH,T)> s. 




(^) W.l.o.g. suppose > 2. By Proposition if (/c''?^ T^'g^ sup, n-k,s) isaYES 
instance then there exists a rule S of length n—fc and s.t. > l)/2+ci. 

Since n — k>2, BAH cannot contain a condition /„+i = 1. We have, indeed, that 

VJ e r^'i : J / /„+i then |Tj^i^j,^_^^^i | = 0. Let Z = B A i/. 

Note that each transaction in T*^ has size n — 3 and no duplicate item exists. In order to 
be IT^ ''I > k{k - l)/2 + Cl, T^'^ contains, at least, the set 

r = - {Ir^Jr,Jn+l} eT^\l<X<y<k} 

i.e. the nodes v^, ■ ■ ■ , Vr^ form a clique of G having size k. 

(Membership) Certificate: a condition G. Polynomial checking: verify that |C| > fc and 

sup{G, T)>s. □ 



Note that the special case (/, T, sup, k, 1) can be easily shown to be in P. 
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Lemma 6.1 Let C be a condition on a set of boolean attributes. Then there exists a family 
{count{C)m.n} of ^AC2 circuits computing \Tc\ over any input database T defined 
on a set of boolean attributes I such that att(C) C /. 

Proof. Let att(C) CI — {Ai, . . . , A„}. We define the family {count{C)m,n} of 
#^6*2 circuits as follows. The circuit count{C)m.n has m x n binary inputs Xij, 
i = 1, . . . ,m, j — 1,... ,n, with m = |r| and n = \I\. The input Xi,j is 1 if 
ti[Aj] — c{Aj), otherwise (i.e. if = The first level of count{C)m,n 

consists of m x-gates Gi, for i = 1,... ,m. Each gate G,; receives the |C| inputs 
{xi k I Ak G att(C)}. Thus the output of Gi is 1 iff U h G. The second level of 
count{G)m,n consists of a single +-gate receiving in input the outputs of all the Gi gates, 
for i = 1, . . . ,m. Thus the circuit count{G)m.n calculates \Tc\ when the input has size 
m X n. □ 



The forthcoming Theorems (5.2,5.3 and 5.4) associate some task related to mining as 



sociation rules to very low complexity classes such as TG'^ and AC". It turns out that 
these problems are highly parallehzable (recall that AG° C 7U" C NC^, [0). 



Theorem 6.2 Let I be a set of boolean attributes, and let k be a fixed constant. Then the 
complexity of {I, T, sup, k, s) is in TC'^. 

Proof. Let s be codified as a pair of naturals (a, b) such that s = a/h, and let C be a 
condition on a subset of /. Consider the function /(C, T, s) — {b\Tc\ + 1) — a\T\ tak- 
ing value over integers. Let i? ^ iJ be an association rule on /, and let In be the set 
att(B A H). Clearly, sup{B ^ H,T) > s iff f{B A ff, T, s) > 0. 
We recall the following result [^: for each integer N there exists a log-time uniform 
^AC^ circuit, which computes N, when the binary representation of N is given in input. 
Say this circuit number{N). Since a and b are integers, we can build two ^AG^ circuits 
computing the functions &|Tc| and a\T\ = am, connecting number (b) to count{G)m.n 
and number [a] to number [m) through a x-gate, respectively. 

Then, the function f{G, T, s) is in the class GapAG'^, and the language {B ^ H on I \ 
sup{B ^ H,T) > s} is in the class PAG^ which coincides with TG^ under log-space 
uniformity ||]. Thus, there exists a constant-depth polynomial size uniform family 
{G'{lR)m,n) of circuits of unbounded fan-in AND, OR and MAJORITY gates, such that 
G'{lR)m,n outputs 1 iff sup{B ^ H ,T) > s, when the input database has size m x n. 
We can build a TC° family circuits solving the (/, T, sup, k, s) problem when k is fixed 
as follows. Consider the circuit G{I)m.n obtained connecting the outputs of all the cir- 
cuits G' {Iii)m^n, with Ih ^ I such that \Iii\ = k through an OR gate. Since the number 
of these circuits is ('^') = 0{\I\'^), hence polynomial, Gm,n{I) has constant depth and 
polynomial size as well. The result then follows from Proposition |3.2|. □ 



It is of interest to investigate the complexity of mining association rules when the value 
s|r| is fixed. In this case (/, T, sup, k, s) corresponds to the problem of finding an asso- 
ciation rule satisfied by almost a fixed number of transactions. Such a problem becomes 
of relevance when it is necessary to find a fixed size set of transactions satisfying a certain 
property (e.g. in statistic sampling, see [[ist). 

Definition 6.1 Given a set of boolean attributes / = {Ai, . . . , An}, and a database T = 
{ti, . . . , tjn} defined on /, we define (/, T)^^ to be equal to the pair (/', T'), where 
/' — {A[, . . . , A'^} is a set of boolean attributes, where each A'j denotes the jth tuple of 
T, for j = 1, . . . ,m, and T' — {t'l, . . . , t'^} is a database defined on /', with t'^ such that 
= 1 if ~ c{Ai), and t'^[A'j] — e otherwise (i.e. if = e), corresponding 

to the jth attribute of /, for i = 1, . . . ,n, j = 1, . . . ,m. 

^Note that here and elsewhere, by Httle abuse of notation, for simpUcity, we denote a circuit family recog- 
nizing inputs in the form of a m X n boolean matrix by using the subscript m, n instead of one single subscript 
specification denoting the input size 
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Proposition 6.1 Let be J a set of boolean attributes, let T be a database on /, let k 
be a natural number, 1 < fc < |/| , let s, < s < 1, be a rational number, and let 
(J',T') = {I,T)-^. Then: 

(/, T, sup, k, s) is a YES instance 

k 

(/', T', sup, [s|Tn ' > is a YES instance (1) 

Proof. (/, T, sup, k, s) is a YES instance iff there exist an association rule B ^ H on 
I s.t. \B ^ H\ > k, and \Tb/\h\ > \s\T\'\ iff there exist an association rule B' =^ H' 
on /' s.t. \B' ^ H'\> \s\TW and \T'jj,/^h' I > fc iff (^', T' , sup, \s\TW , ^) is a YES 
instance. □ 

Theorem 6.3 Let I be a set of boolean attributes, and let [s|r|] be a fixed constant. Then 
the complexity of {I, T, sup, k, s) is in TC'^. 



Proof. The result follows immediately from Theorem 5.2 and Proposition 5.1 



□ 



Theorem 6.4 Let I be a set of boolean attributes, and let k and \s\T\~\ two fixed con- 
stants. Then the complexity of {I, T, sup, k, s) is in AC2. 

Proof. Let / = {Ai, . . . , An}, and letT ~ {ti, . . . , t„i}. Let _B =^ iJ be an association 
rule on /, and let Ir be the set att(i? A H). Define the family {C"(/_R)„i,„} of AC° 
circuits as follows. 

The circuit C'{lR)m,n has n x m binary inputs Xi.j, i = 1, . . . ,m, j ~ 1, . . . ,n, 
with TO — \T\ and n = \I\. The input Xij is 1 if ti[Aj] = c{Aj), otherwise (i.e. if 
ti[Aj] — e). The first level of C'{lR)m.n consists of to AND gates Gj, for i = 1, . . . , to. 
Each gate Gj receives the inputs {xi,k \ A^ e Ir}. 

Thus the output of G\ is 1 iff ti h (_B A H). The second level of C'{lR)m,n consists 
of (^2^) AND gates G], for j - 1, ... , where 

g^{F^{G\,...,Gl}:\F\^\sm\} 

The gate G| receives in input the outputs of the [sto] gates contained within the j-th 
element of g. 

The third level consists of a single OR gate receiving in input the outputs of all the 

G| gates, for j = 1, . . . , (pj^i). Thus the circuit C"(/_r),„,„ decides if \Tb/\h\ > \srri\. 

The size of each circuit C'{lR)m.n is polynomial, since \g\ < m^^"^'^ , and [sto] is fixed. 
We can build an AC^ circuit solving {I,T,sup,k,s), for k and [s|T|] fixed, as fol- 
lows. Consider the circuit C(/).,„ „ obtained connecting the outputs of all the circuits 



C {lR)m,n, with Ir I such that \Ir\ = k (it suffices from Proposition 3.2), through an 
OR gate. 

Since the number of these circuits is ('^') — C'(|/|'^), hence polynomial. Cm. „(/) has 
constant depth and polynomial size as well. The first and second level (of AND gates), 
and the third and fourth level (of OR gates), can be easily each reorganized into a single 
level, thus giving an overall circuit family of depth 2. Hence the result follows. □ 



7 Conclusions 

In this paper, we have analyzed the computational complexity of mining association rules. 
We have considered the most widely accepted form of association rules that use well- 
known quality indices, namely, support, confidence, gain and laplace. After having for- 
mally defined association rule mining problems, we have shown that the general versions 
of these problems are NP-complete, except when confidence is measured on database 
without nulls. 
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Then, we have focused on analyzing several interesting restricted cases, for most of 
which lower complexity bounds have been proved to hold. It is relevant to note that these 
cases are often related to complexity classes for which the existence of highly paralleliz- 
able algorithms has been proved. For example, for sparse databases, the complexities of 
the mining problem lies within L. In some other analyzed cases, where some of the pa- 
rameters of the mining problems are considered as fixed constants, the mining problem 
lies in TC" or in AC". 

The complexity analysis presented in this paper is not complete, though. For instance, 
it is relevant to analyze the complexity induced by adopting other indices as, for instance, 
entropy and improvement [|l^ [l^. Moreover, other forms of association rules could be 
considered as, for instance, sequential patterns We leave these topics to future re- 
search. 
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